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Abstract 

We describe experiments with a versatile pictorial prototype based learning scheme for 
3D object recognition. The GRBF scheme seems to be amenable to realization in biophysical 
hardware because the only kind of computation it involves can be effectively carried out by 
combining receptive fields. Furthermore, the scheme is computationally attractive because 
it brings together the old notion of a “grandmother” cell and the rigorous approximation 
methods of regularization and splines. 
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Figure 1: Application of a general module for multivariate function approximation to the 
problem of recognizing a 3D object from any of its perspective views. In (a), the module 
is trained to produce the vector representing the standard view of the object, given a set of 
examples of random perspective views of the same object. The module is also capable of 
recovering the viewpoint coordinates 0,<j> (the latitude and the longitude of the observer on an 
imaginary sphere centered at the object) that correspond to the training views. When given a 
new random view of the same object (b), the module recognizes it by producing the standard 
view. Other objects are rejected by thresholding the euclidean distance between the actual 
output of the model and the standard view. 

1 Introduction 

An intelligent visual system is expected to be able to retain representations of objects it encoun¬ 
ters and to recognize these objects later, under potentially different viewing conditions. This 
requires the solution of at least three difficult problems. The first problem is the variability 
of object appearance due to changing illumination, which may be addressed by working with 
relatively stable features, such as intensity edges [1] (preferably, in conjunction with cues from 
visual motion and stereo [2]), rather than with raw intensity images. The second problem, the 
removal of the variability due to unknown pose of the object, may be solved by first hypothesiz¬ 
ing the viewpoint (e.g., using information on feature correspondences between the image and a 
model), then computing the appearance of the model of the object to be recognized from that 
viewpoint and comparing it with the actual image [3, 4, 5, 6]. Generally, recognition schemes of 
this type employ 3D models of objects. Automatic learning of 3D models is the third difficult 
problem faced by state-of-the-art recognition schemes. Few of these schemes learn to recognize 
objects from examples and most use 3D models acquired through user interaction (see, e.g., 
[6]) or through active sensing (e.g., range data; [7, 8]). 

In this paper, we describe an implemented scheme for recognizing wire-frame objects that 
addresses two of the three aspects of the recognition problem mentioned above: learning object 
representations and generalizing recognition to novel viewpoints. We base our approach on a 
recently proposed network scheme for the approximation of multivariate functions, by coaching 
the problem in terms of the synthesis of a module that generates a representation of an object 
(e.g., produces a “standard” view) given any of its perspective views (Figure 1). 
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2 Theoretical basis 

2.1 How much information is necessary for learning 3D structure? 

Structure from motion theorems [9,10], pioneered by Ullman [11], indicate that full information 
about the 3D structure of an object represented as a set of feature points (at least five to eight) is 
present in just two of their perspective views, provided that corresponding points are identified 
in each view. A view is represented as a 2 N vector *i, j/i, X 2 , yii ■•••, *JV j VN of the coordinates 
on the image plane of N labeled and visible feature points on the object. Here and in most 
of the following we assume that all features are visible, as they are in wire-frame objects. The 
generalization to opaque objects follows by partitioning the viewpoint space for each object 
into a set of “aspects” [12], corresponding to stable clusters of visible features. In principle, 
therefore, having enough 2D views of an object is equivalent to having its 3D structure specified. 

2.2 Learning as hypersurface interpolation 

This line of reasoning, together with properties of perpective projection, suggest (a) that for each 
object there exists a smooth function mapping any perspective view into a ’’standard” view of 
the object and (b) that this multivariate function may be synthesized, or at least approximated, 
from a small number of views of the object. Such a function would be object specific, with 
different functions corresponding to different 3D objects. Furthermore, the application of the 
function that is specific for one object to the views of a different object is expected to result in 
a “wrong” standard view that can be easily detected as such. 

Synthesizing an approximation to a function from a small number of sparse data — the views 
- can be considered as learning an input-output mapping from a set of examples [13, 14]. A 
powerful scheme for the approximation of smooth functions has been recently proposed under 
the name of Generalized Radial Basis Functions (GRBFs) and shown [13, 14] to be equivalent 
to standard regularization [15, 16] and generalized splines ([13]; see closely related work by 
Powell [17] and Broomhead and Lowe [18]). The approximation of / : R n —*• R is given by 

/( x ) = E c « Gr (ll x - t «il) (*) 

01=1 

where the coefficients c a and the centers t a are found during the learning stage and G is an 
appropriate basis function (see [13,14]), such as the Gaussian. If the function f is vector-valued, 
each component fc is computed using eq. 1 with the appropriate c; a , in which case the equation 
is precisely equivalent to the network of Figure 2. The function f(x) in equation 1 minimizes 
the error functioned 


AT 

iT[/] = E(*-/(*» ! + A||P /|| 2 ( 2 ) 

*=1 

on the set of examples. In equation 2, P is usually a differential operator and A is a positive 
real number, called the regularization parameter [15]. The radial function G is fully determined 
by the stabilizer P in eq. 2 and therefore by the prior assumptions on the function to be 
approximated, such as its degree of smoothness [13]. P also determines whether a polynomial 
term of the form djpj(x) should be added to the right-hand side of eq. 1. In most of the 
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experiments described in section 3.6 we omitted the polynomial term and used the Gaussian as 
the radial basis function. The optimal width <t of the Gaussian RBFs can be found, along with 
c a and t a , by minimizing H in equation 2. 

In a special simple case, there Eire as many basis functions (K) as views in the trEiining set 
(M; in general, K < M). The centers of the radial functions are then fixed Eind are identical 
with the training views. Each beisis unit in the “hidden” layer computes the distEmce of the new 
view from its center and applies to it the radial function. The resulting vEtlue G(||x — t a ||), can 
be regarded as the “activity” of the unit. If G is Gaussicin, a basis unit will attain mEiximum 
activity when the input exactly matches its center. The output of the network is a linear 
superposition of the activities of all the basis units in the network. 

Figure 26 illustrates the special case of Gaussian basis functions. A multidimensionEd Gaus¬ 
sian can be synthesized eis the product of two-dimensional Gaussian receptive fields operating 
on retinotopic maps of features. The solid circles in the image plane represent the 2D Gaus- 
sians associated with the first radial basis function, which corresponds to the first view of the 
object. The dotted circles represent the 2D receptive fields that synthesize the GaussiEin radial 
function associated with Einother view. The Gaussian receptive fields trEinsduce positions of 
features, represented implicitly as activity in a retinotopic array, and their product “computes” 
the radial function without the need of calculating norms and exponentials explicitly. 1 

The weights C are found during leEirning by minimizing a measure of the error between the 
network’s prediction and the desired output for each of the exEimples. Computationally, this 
amounts to inverting a matrix (when M K, the generalized inverse is computed instead) Eind 
is equivalent to finding Ein optimEil generEilized spline approximation (interpolation when A = 0 
in equation 2) with fixed knots. 

If the centers of the basis functions are Eillowed to move (which may be desirable, e.g., when 
the number of bEisis functions is less thEin the number of views in the training set), the scheme 
becomes equivalent to a spline with free knots. The centers may be updated during leEirning 
by a gradient descent minimizing the approximation error expressed by equation 2. A further 
generEilization may be achieved by using a weighted norm in equation 1: 

||x-t Q ||^-= (x-t a ) T W r W(x-t a ) (3) 

Updating the centers is equivalent to modifying the corresponding “prototypicEil views” and 
corresponds to task-dependent clustering. Finding the optimal weights for the norm is equiv¬ 
alent to a transformation of the input coordinate space and corresponds to task-dependent 
dimensionality reduction. A more detailed description of the GRBF approximation technique, 
of its theoretical motivation Eind of its relation to other techniques such as backpropagation [20] 
can be found in [13, 14]. 

3 Implementation and performance 

We have conducted an empirical investigation of the applicability of GRBFs, under a variety 
of conditions, to the problem of shape-based object recognition. The results of a series of 

‘Implementing a multidimensional receptive field as a product of 2D receptive fields all of which look at the 
same retina can result in “cross-talk” between different features if the spatial extent of the receptive fields is 
not limited. This does not seem to be a problem with Gaussian receptive fields, which respond very weakly to 
features that are far from the field’s center (cf. [19]). 
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Figure 2: (a) A network representation of approximation by Generalized Radial Basis Functions, 
(b) shows an equivalent interpretation of (a) for the case of Gaussian radial basis functions. 
The solid circles in the image plane represent the 2D Gaussians associated with the first radial 
basis function, which corresponds to the first view of the object. The dotted circles represent 
the 2D receptive fields that synthesize the Gaussian radial function associated with another 
view. 


experiments that involved simple computer-generated shapes are described below. 

3.1 Input objects 

Objects for testing the recognition scheme were created using the Symbolics S-Geometry 3D 
graphics modeling system. The objects were 5-segment random wire frames 2 (Figure 3). All 
the objects were positioned in such a manner that their centers of mass coincided with the 
origin of the 3D coordinate system defined by the modeling program. Different views of the 
objects were obtained by rotating the S-Geometry “camera” around the 3D origin, so that it 
could assume any position specified by two viewpoint coordinates, 0 and (j), corresponding to 
the latitude and the longitude on an imaginary sphere centered at the object. No rotation of 
the camera around its optical axis was allowed. 

3.2 Input representations 

We have experimented with several different methods of encoding object shape, all of which 
employed exclusively the 2D information available in the projection of the objects’ vertices 
onto the imaging plane. The first and most straightforward method was used in most of the 
experiments described in this section. 

2 In some of the experiments, 7-segment wires or other objects such as wire-frame cubes and octahedra were 
used. 
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Figure 3: Two examples of wire objects used in the experiments. The wires were created by a 
random walk in 3D. They were encoded for training and subsequent recognition by projecting 
the vertices onto an im a g i ng plane (under either orthographic or perspective projection). The 
resulting vector of x, ^-coordinates could be further preprocessed to obtain different encodings 
(see section 3.2). 

1. XY-coordinates. A list of the screen coordinates of the wire’s vertices, (*i, yi ,..., y„). 

The origin of the screen coordinate system was at the upper left corner of the screen, and 
the coordinates varied in the [0..127] range. 

2. Centered XY-coordinates. Same as previous, but with the origin at the screen projection 

of the 3D center of rotation common to all the objects. 

3. Segment lengths. Screen distances between the projections of the successive vertices of 

the objects. 

4. Normalized segment lengths. Same as previous, but with the lengths divided by the 

length of the first segment. 

5. Angles. Angles formed by the projections of the successive segments. 

6. Angles + lengths. A mixed encoding, combining the angles and the segment lengths in 

one heterogeneous vector. 

Note that the fifth encoding method (angles) leads to the invariance of recognition perfor¬ 
mance with respect to translation, scaling and image-plane rotation of the objects. Another 
point of interest is that nothing in the present approach precludes information other than 2D 
shape from being incorporated into the input representation. In particular, 3D shape cues 
(obtained, e.g., through binocular stereo) can be used within the same framework depicted in 
Figure 1. We shall return to this point in the discussion. 

3.3 Output representation 

As depicted in Figure 1, the recognition module was trained to produce a standard output for 
any input that showed a view of the target object. The output representation was identical 
to the input one (as a matter of fact, the first input view was chosen as the standard one). 
However, in addition to the standard view of the object whose arbitrary view was presented as 



input, the system was also capable of recovering other information about that object, namely, 
its attitude (as expressed by the viewpoint coordinates 9 and <f>). 3 

3.4 Test paradigm 

The primary measure of the system’s performance was the standard view recovery error, defined 
as the euclidean distance between an actual output and the ideal one. Two statistical mea¬ 
sures of performance were computed in each of the experiments to be described below. These 
measures involved training the system on each of 10 different wire objects and comparing the 
standard view recovery errors for views of the trained object with those of the other nine ob¬ 
jects. The errors for the trained object should be small, compared to the errors for the other 
objects (Figure 4). Ideally, the smallest error on a non-target object (call it MINnontarget) 
should be larger them the largest error on the target (M AX tar get)- a MIN/MAX ratio greater 
than 1 is required for a perfect separation between the target and other objects using a simple 
threshold decision. A less conservative measure is the ratio of the averages of the two error 
classes, AVG/AVG * 

3.5 Example of operation 

Two examples of the module’s operation, one in which the input is the training object, and 
another in which it is a different but s imil ar object, appear in Figure 5. The top row shows 
the standard view of a wire frame object, superimposed on its estimate by the GRBF network 
(large black dots), when its input is a random view of the same object (second from top row). 
The fit is much closer than in the bottom two rows, where the input view belongs to a different 
object. 

From Figure 5 it appears that arbitrary views of the target object cause the GRBF module to 
output a vector that is close to the ideal (trained) one. It also appears that views of non-target 
objects are transformed into scaled versions of the ideal vector, so that Y^ t = kYou^ideal), 
where k < 1. To understand why that happens, it is convenient to consider first a linear 
associative memory that is realized by a matrix operator C trained to recognize views of a 
target object by transforming them into a preset standard vector Y. Since C maps distinct 
vectors Vi to the same vector Y, it must be singular (it can be shown that the rows of C are 
all collinear). If the number of (randomly chosen) training views Vi is sufficiently large, there 
is a good chance that they span a 6-dimensional manifold that, to a first approximation, is a 
hyperplane in R 2N (see the appendix). Any new view V will lie within this hyperplane and will 
be mapped to a scaled kY. Views of non-target objects will tend to be orthogonal to the space 
spanned by the training views, resulting in k « 0. An analogous argument can be made for the 
RBF scheme, in which the linear mapping C is preceded by the application of the radial basis 

3 We have also experimented with a scalar output representation; see section 3.6.8. 

4 Standard statistical methods of parameter estimation and hypothesis testing may be used to translate the 
means and the standard deviations of MINnontarget, MAXtarget, AVGnontarget and AVGnontarget into prob¬ 
abilities of Type I and Type II recognition errors (see e.g. [21]). Since these methods involve table lookup 
of probability distributions, we did not use them on-line. Characteristically for our experiments, a ratio of 
AVGnontarget to AVGnontarget of 5.0 sufficed to impose a 0.001 upper bound on the probabilities of both Type I 
and Type II errors. 
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Error range for views 
of other objects 


Figure 4: Definitions of the AVG/AVG and the MIN/MAX performance criteria used through¬ 
out the paper. The error here is defined as the euclidean distance between the standard view of 
the target and the actual output of the system (the smaller the error, the greater the likelihood 
that the input view belongs to the target). In this illustration, the average error for non-targets 
is considerably greater than that of target views. Consequently, there is a good chance of 
correct recognition of the target (and correct rejection of non-targets). An ideal performance 
requires that there be no overlap between the error value ranges corresponding to target and 
non-target views, in which case MIN/MAX > 1 and the two classes of views are separable by 
thresholding. 
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Figure 5: Examples of the module’s operation. Above: standard view of a wire frame object 
(top row), superimposed on its estimate by the GRBF network (large dots), when its input is 
a random view of the same object (second from top row). The fit is much closer them in the 
bottom two rows, where the input view belongs to a different object. The number of training 
views M — 40, the number of RBFs K = 20 and the range of attitudes 0, <j> is 0° to 90°. 
A naive fixed-step gradient descent (with a small number of steps) was used to obtain the 
optimal positions of the GRBF centers. Below: within a smaller range of 0,<f> G [0°, 45°], the 
performance was acceptable with only two radial basis units: M = 40, K = 2 (Note that in the 
“different object” row the dots signifying the predicted vertex locations are in most cases off 
the scale). 
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functions G a • The analogy is then between the original training vectors Vi and their images 
under G a . 

3.6 Performance 

3.6.1 Effects of receptive field size and of number of centers 

In the first experiment, the number K of RBF centers is made equal to the number M of 
training views by letting the centers coincide with the views themselves. Consider Figure 6, 
which shows the dependency of the error (distance between actual and ideal outputs) for random 
views of the trained object (left column) and the error for views of other objects (right column), 
as a function of K and of the size or of the (Gaussian) basis functions. Figure 6 conveys 
information as to the relative significance of the average and worst-case performance of the 
recognition module over the depicted range of K and a. The worst-case performance (assessed 
by comparing the upper curve in the left column with the lower curve in the right column) 
lags far behind the average performance (assessed by comparing the middle curves in the two 
columns). It should be noted that the role of the outliers that contribute to the worst-case 
measure is statistically insignificant, as long as the average performance does not drop below a 
certain threshold (corresponding to an AVG/AVG ratio of about 5). 

The next plot provides a direct answer to the question of the optimal combination of K and 
cr. Under the AVG/AVG measure (Figure 7, middle column), it is cr = 25, for K = 100 = M 
(clearly, increasing the number of training views and RBF centers improves the performance, 
but the price in terms of computational resources makes it probably not worthwhile to increase 
K and M beyond about 80 - 100). Under the MIN/MAX measure, the best performance is 
achieved for <r = 30 (Figure 7, right column). The left column of Figure 7 gives a different 
perspective on the module’s performance, by plotting the proportions of Type I and Type II 
recognition errors vs. cr. Note that having too much interpolation (in this case, a > 25) sharply 
increases the probability of a Type II (false alarm or overgeneralization) error, as expected. 

3.6.2 Effect of perspective projection 

The result of Ullman and Basri [22] on representing objects by linear combinations of views 
suggests that recognition posed as a problem in function approximation is better behaved under 
orthographic than under perspective projection. We have tested the GRBF module with two 
different settings of the distance of the simulated camera from the objects: “near”, in which 
there was an appreciable perspective distortion, and “far”, in which the distortion was almost 
unnoticeable (this served as an approximation of orthographic projection condition). From 
Figure 8 it can be seen that doubling the distance from “near” to “far” made no significant 
difference in the performance. 

A separate look at the false alarm and the miss rates (Figure 9) shows that if camera distance 
had any effect, it was on the miss rate. The most prominent effect was the decrease in the miss 
rate under orthographic approximation for M = K = 20. This finding is consistent with the 
Ullman-Basri theoretical argument for the relative ease of recognition under the assumption of 
orthographic projection. 
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Figure 6: Error (distance between actual and ideal output) vs. the size a of the basis functions, 
for modules with different number of centers K (the number of training views M is equal here 
to K). Data are shown for two input sets: random views of the trained object (ERV, left 
column) and views of other nine objects (EOO, right column). Three measures of the error, 
MIN (lower curves), AVG (middle curves) and MAX (upper curves) are shown separately. Bars 
indicate standard deviation, computed over ten different training objects. 
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Figure 7: Left column: Type I or miss (MISS; lower curve) and Type II or false alarm (FA; 
upper curve) recognition error rates, vs. <r, by the number of centers K. Middle column: 
AVG/AVG performance index. Right column: MIN/MAX performance index (see section 3.4 
and Figure 4) vs. <r, by K. 
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Figure 8: Left column: AVG/AVG performance index vs. the distance of the objects from the 
simulated camera, by the number of training views M (here K = M, a — 30.0). The “near” 
distance is about seven times the apparent width of the wire objects used for this experiment. 
Right column: MIN/MAX performance index vs. the distance of the objects from the simulated 
camera, by the number of training views M (here K — M, <r = 30.0). 
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Figure 9: False alarm (FA, left column) and miss (MISS, right column) rates vs. the distance 
of the objects from the simulated camera, by the number of training views M (here K = M, 
er = 30.0). 
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Figure 10: AVG/AVG and MIN/MAX performance vs. the range of the viewpoint coordinates 
0 , <f> (the objects are a cube find an octahedron, M = K = 40 <r = 30.0, and the error bars are 
standard deviations over 10 sets of random training and testing views). Here and in the next 
figure, <f> max = 29 max , so that 0 max = 180° corresponds to the full viewing sphere. 

3.6.3 Effect of range of attitudes 

If the number of training views is held constant, the performance of the GRBF module is 
expected to deteriorate with the increase in the range of the viewpoint coordinates into which 
the training views fall. Figure 10 shows that this indeed happens: for M = K = 40 and a = 30, 
both the AVG/AVG and the MIN/MAX measures take a sharp dip when 9, <j) reach (120°, 240°). 

3.6.4 Recovery of attitude 

The range of the allowed orientations has a similar influence on the precision of the recovery 
of the orientation parameters 9, <f> (Figure 11). For M = K = 40 and <r = 30, the mean square 
error of the recovered orientation stays below 10° for 9 < 120°, <f> < 240°, rising to about 60° 
for the full range of orientations. Doubling M find K extends effective recovery of 9, <j> to the 
full range of orientations. 

3.6.5 Effect of number of vertices 

The power of the GRBF module to discriminate between trained object and other, similar 
objects increases with the increase in the number of vertices used in the encoding (Figure 12). 
The discrimination power is nil ( MIN/MAX = 0) for two-vertex objects, rises steadily with 
the number of vertices, then starts to drop. This may be due to an interplay of two factors: 
the amount of information rind cross-talk among GRBF centers. At least four points on each 
object are necessary for discrimination (see the appendix). The more vertices are used, the more 
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Figure 11: Errors in the viewpoint coordinates 0,<j> recovered by the module vs. the range of 
the viewpoint coordinates (M = K = 40, a — 30). 


information there is for the recognizer to go by, until cross-talk sets in (which will happen if the 
size of the basis functions <r is not allowed to decrease in proportion with the increased density 
of object vertices in the image plane). In human recognition, a similar effect is intuitively 
expected (30-vertex wire objects seem to be too complicated to be distinguished by vertex 
positions alone). 

3.6.6 Different input/output representations 

The versatility of the present approach to recognition is illustrated in Figure 13, which shows 
superimposed a plot of the MIN/MAX performance vs. the number K of RBF centers for 
the regular encoding used throughout the paper (r, y-coordinate vectors) and a shift, scale 
and image-plane rotation invariant encoding (angles between successive segments of the wire 
objects). For a six-vertex object, the x,y -coordinate vector has length 12, while the angles 
vector has length 4. The relatively smaller amount of information in the angle encoding puts 
it at a disadvantage for smaller K's. For a large enough K the angle encoding yields higher 
MIN/MAX ratio, in addition to possessing desirable invariance with respect to shift, scale and 
rotation of the input. 

3.6.7 Sensitivity to occlusion 

To find out the sensitivity of the GRBF scheme to occlusion, we have repeatedly trained it on 
views each of whose constituent features had a fixed probability of being “occluded” (in which 
case the corresponding component of the representation vector was set to 0). Note that more 
than one feature could be occluded at a time. 

The performance of the GRBF module in subsequent testing, plotted vs. the probability of 
individual vertex occlusion, is shown in Figure 14. 5 It appears from the figure that decent per¬ 
formance can be expected even when the probability of having any particular feature occluded 
is 0.2, in which case about three quarters of the training views had at least one of the features 

B Although no occlusion was assumed in testing, one can get an idea of the scheme’s sensitivity to this factor 
by considering Figure 12, which shows the effect of the number of features on the discrimination power. 
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Figure 12: AVG/AVG and MIN/MAX indices vs. the number of vertices used in training (the 
data for number of vertices from 2 to 6 are for six-vertex random wire objects; the data for 
number of vertices 7 and 8 are for eight-vertex wires; M = K = 60, <r = 30.0). 



Figure 13: MIN/MAX performance for two types of input encoding: vertex coordinates (solid 
line) and angles formed by successive pairs of segments (dashed line; data for six-vertex random 
wire objects, a chosen optimal for each encoding, M — K). 
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Figure 14: AVG/AVG and MIN/MAX indices vs. the probability of any given vertex being 
occluded (left: six-vertex random wire objects; right: eight-vertex objects). 9 , <j> were confined 
to one half of the viewing sphere; K = M = 50 (lower curves), K = M = 100 (upper curves); 
a = 30.0. 

occluded. Occlusion has had a somewhat stronger effect on the learning of eight-vertex wires 
(Figure 14, right column). 

Note that in the present experiment the basic GRBF scheme was not augmented by any 
mechanism specifically designed to dead with occlusion. A better insensitivity to the deletion 
(occlusion) of features can be achieved by providing a basis function (center) for each possible 
subset of features. We conjecture that in practice the maximum size of necessary feature subsets 
is rather small. This size could be found during learning, by analyzing the weight matrix W. 

3.6.8 Scalar vs. vector output 

If a compact output representation is required, it is possible to train the recognition module to 
produce a scalar output, as opposed to a vector that represents a standard view. Figure 15) 
shows that the single-output network performs on the average almost as well as the network of 
Figure 2 (which outputs a standard view vector). The advantage of the vector-output module 
may be explained by the larger number of its free parameters (elements of the C matrix). 

3.6.9 GRBF, using gradient descent 

In most of the experiments described in this report, the GRBF module was trained without 
searching for optimal center locations t a , coefficients C or weights W (equation 3). In these 
cases, the centers were set at some of the training views, the matrix C was found by a generalized 
inverse method (see section 2), and an identity matrix was used as W. 
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Figure 15: AVG/AVG and MIN/MAX performance for scalar and vector output (data for 
six-vertex random wire objects, <r = 30, M = K = 80; error bars show standard deviation 
computed over 10 objects). In the first case, the network is trained to output 1 when show 
views of the target. In the second case, the output is a standard view of the target. 


The parameters t a and C obtained in this manner serve as a convenient starting point for 
improvement using gradient descent search in the parameter space. The gradient descent was 
performed according to the expressions given in [14]. 6 We have compared the performance 
improvement for this encoding under three conditions: changing centers t a , or weights W, or 
both (the coefficient matrix C was always allowed to change), for two sets of parameter values. 
Only trials for which the gradient descent procedure actually converged were included in the 
comparison. The results for M = 40, K = 10 and a full range of viewpoints appear in Figure 16. 
Note that the best effects were achieved by a combined adjustment of C, t a and W together. 
A visual example of the performance of the GRBF module with K = 10 centers and M = 40 
training views after the adjustment of the centers’ locations through gradient descent appears 
in Figure 5. 

3.7 Comparison with related schemes 

At this point it is natural to ask whether other, simpler network schemes can perform in the 
recognition task defined in this report as well as does the GRBF module. To address this 
question, we investigated the performance of three related schemes: linear associative memory 
and two versions of the nearest neighbor classifier (with and without feature correspondence). 

6 Since these expressions pertain to the case of a single-output network, we used such a network in this 
experiment. 
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M, number of training views 


Figure 17: AVG/AVG and MIN/MAX indices vs. the number of training views for the linear 
associative scheme (six-vertex random wire objects). 




Figure 18: AVG/AVG and MIN/MAX indices vs. the number of remembered views for the 
nearest-neighbor method that uses correspondence information (six-vertex random wire ob¬ 
jects). For comparison purposes, the performance of the RBF scheme with M = K is also 
shown (dashed curve). 
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Figure 19: AVG/AVG and MIN/MAX indices vs. the width a of the Gaussian blurring mask 
(see section 3.7.3) for the nearest-neighbor method that uses 2D correlation and 2D array 
representation of views, instead of correspondence information and ID vector representation of 
views (six-vertex random wire objects; the number of remembered views is 80). 

This requirement can be dispensed with, at the cost of reduced performance, as follows. Define 
recognition error for a given object as the inverse of the sum of 2D correlations between each 
of the stored training views (represented in this case as 2D arrays rather than as ID vectors of 
vertex coordinates) and the input view. Low error would then be obtained for an input that 
is “close” to at least one of the stored views. To improve the generalization ability of the NN 
classifier that relies on 2D correlation, the input view is blurred (convolved with a Gaussian 
mask) before the correlations are computed. The dependence of the performance on the size of 
the blurring mask is shown in Figure 19. 

4 Discussion 

The reconstructionist dogma of computational vision appears recently to have fallen upon hard 
t im es. A standard version of this dogma holds it that the ultimate goal of a visual recognition 
system is the formation of object representations that make explicit the relevant 3D structure, 
just as a toy airplane makes explicit the relative size and position of the wings and the fuselage 
in the real airplane [24]. This view of recognition considers the 2D image bottleneck that 
necessarily intervenes between the distal object and its percept a nuisance, to be overcome, e.g., 
by invoking relevant physical and computational constraints [1]. Due to persistent difficulties at 
the higher levels of the reconstructionist program (see [25, 26, 27] for reviews), inverse optics all 
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the way to the top [1, 28] no longer seems to be the most promising approach to recognition. 8 

The performance of the GRBF module described in the foregoing sections suggests that ob¬ 
ject recognition can be done without first reconstructing the third dimension of the visual input, 
and without relying on three-dimensional object models (see also [29, 22, 19]). Furthermore, 
adopting the present approach to recognition does not mean giving up the use of information 
beyond 2D shape (color, texture and depth). Computationally, therefore, there seems to be 
no reason to reject the memory-based function approximation approach to recognition out of 
hand. 

In the study of biological vision, the notion that in the primate visual system objects are 
represented by single units each of which responds selectively to a specific object, dubbed the 
grandmother cell dogma, used to draw criticism, for a number of reasons. The arguments given 
against it included the limited memory capacity of the brain and the lack of neurobiological 
and psychological support. The results reported in the previous sections indicate that doing 
function approximation rather than straightforward template matching may solve the memory 
capacity problem. Furthermore, the function approximation approach is also compatible with 
prominent biological and psychophysical findings on recognition, outlined below. 

4.1 Biological aspects 

4.1.1 Receptive fields 

One feature of the GRBF scheme that may guide its biological interpretation is the expressibility 
of its function in terms of combinations of receptive fields. It is possible to decompose a 
multidimensional Gaussian radial basis function into a product of Gaussians of lower dimensions 
(Figure 2b). In our case, the center of a basis unit plays a role similar to a prototype and the 
unit’s response profile is synthesized as the product of feature detectors with two-dimensional 
Gaussian receptive fields (i.e., the activity of a detector depends on the distance r between the 
stimulus and the center of the receptive field as e~ T2 ^ i ). The network’s output (see equation 
1) is the sum of these products and therefore represents the logical disjunction of conjunctions 
“\J a [\i{feature Fi at (*»,yi))”, where the disjunction ranges over all the prototypes of the 
given object. 

4.1.2 View-specific units 

Cells that respond preferentially not only to a specific object, but to a limited range of that 
object’s views, have been found in the inferotemporal cortex of monkeys by a number of re¬ 
searchers (see [30] for a review). The existence of these “grandmother cells” is compatible with 
the notion of a hierarchical structure of object representations. The lower level of this structure 
may be composed of receptive fields that transduce position of individual features into activity 
of units that encode their presence. The next level would correspond to “grandmother” units 
that encode specific views. In the GRBF terminology, these are the basis units, each centered 
around the view it is tuned to. At a still higher level, a “disembodied” representation of an 

’inverse methods appear to be useful in low-level visual tasks such as stereo and motion computation which 
contribute to the representation that Marr called 2 jD-sketch [16]. At the higher levels, the lack of well-defined 
constraints on the solution that are general enough to be relevant in retd-life situations hinders the application 
of inverse methods. 
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object could be formed by combining several view-specific units, arriving at the disjunction of 
conjunctions representation that stands for the object, irrespective of viewpoint (or position, 
or size). 9 

4.1.3 Separating “what” from “where” 

Rather than discarding the viewpoint information in the process of arriving at the viewpoint- 
invariant representation, the GRBF scheme can retrieve and output it separately (see Figure 1 
and section 3.6.4). As a final parallel between GRBFs and visual neuroscience, we note that 
this separation of form and space resembles the separation between the ventral and the dorsal 
visual pathways, the first of which carries predominantly shape and the second - predominantly 
spatial information from the striate cortex towards temporal and parietal regions, respectively 
(see e.g. [31]). 

4.2 Psychophysical aspects 
4.2.1 Human object recognition 

Another aspect on the biological plausibility of our approach to recognition is provided by 
psychological studies. Different features of human performance in object recognition can be 
interpreted in terms of characteristics of the underlying information-processing mechanism. 
We first mention briefly severed of the most prominent relevant findings and phenomena. 

Object constancy 

Perhaps the most familiar of these is the phenomenon of object constancy: our ability to 
recognize things under widely c hangin g conditions from a variety of viewpoints, and the lack of 
change in the apparent shape of objects under these different conditions. This phenomenon is 
usually illustrated with a simple object such as a coin, which can be easily recognized when seen 
at an oblique angle, and whose outline then appears to us as a circle tilted in depth rather than 
tin image-plane ellipse. Importantly, object constancy works for considerably more complex 
things such as faces and letters. This has prompted some researchers [32] to postulate a shape 
normalization mechanism in object perception, whose function is to bring the viewed shape to 
a standard appearance before recognition is attempted. 

Canonical views 

The existence of standard or canonical views of objects predicted by the shape normalization 
theory is supported by a wide range of experimental data [33]. Canonical views of commonplace 
objects can be reliably characterized using severed criteria. For example, when asked to form 
a mental image of an object, people usually imagine it as seen from a canonical perspective. 
In recognition, canonical views are identified more quickly than others, with response times 
decreasing monotonically with increasing subjective goodness. 

®Marr ([1], p.15) argued that little understanding of how vision is done is gained by invoking the grandmother 
cell hypothesis if it is based only on neurophysiological data. Our approach complements the neurophysiological 
hypothesis by providing one possible computational account of the hierarchical structure of object representations, 
from feature detectors, through view-specific encoding, to grandmother cells. 
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Mental rotation 

The monotonic increase in the recognition latency with misorientation of the object relative 
to a canonical view (as defined independently, e.g. through subjective judgement) prompts the 
interpretation of the recognition process in terms of a mechanism related to mental rotation 
[34]. Specifically, it seems that the recognition process may be decomposed into two stages, 
normalization and comparison [5]. In the first stage, the system carries out the transformation 
necessary to normalize the appearance of the input object. In the second stage, a comparison is 
made between the normalized input and a model stored in memory. Close agreement between 
the two then leads to the recognition of the input as an instance of the model (the two stages 
are presumably executed in parallel for a number of candidate object models). Practice with 
specific objects appears to cause the two-stage strategy to be abandoned in favor of a more 
memory-intensive, less time-consuming direct comparison strategy. Under direct comparison, 
many views of the objects are stored and recognition proceeds in essentially constant tune, 
provided that the presented views are sufficiently close to one of the stored views [34, 35]. 

Invariant features and integration 

Another possible way around the need for time-consuming mental rotation in recognition is 
through the use of viewpoint-invariant features. When the object shapes include potentially 
informative viewpoint-invariant features, and when the experimental setup encourages the use 
of such features, they apparently lead to the disappearance of sequential effects in recognition, 
even for objects that normally do exhibit such effects [36]. The human visual system also 
appears to be able to put to use in recognition cues other than shape, such as color and texture, 
when these are available. 10 

4.2.2 GRBF and the psychology of recognition 

Although the GRBF-based recognition system can hardly be considered a complete model of 
human object recognition, some of its functional characteristics agree with the features of human 
performance outlined above. In particular, recognition by the recovery of a fixed standard 
view of the input object may be considered analogous to the phenomenon of object constancy. 
Furthermore, as an interpolation scheme, a GRBF module necessarily performs better on some 
of the views of the object it has been trained upon (specifically, on the views corresponding to 
the centers of the basis functions) than on other, random views. This characteristic resembles 
the phenomenon of canonical views. Finally, as we have already mentioned above, a GRBF- 
based recognizer can accept inputs from diverse sources of visual information (as well as from 
non-visual sensors). 

At least two of the features inherent in the present formulation of the interpolation-based 
approach to recognition mar its plausibility as a functional model of human object recognition. 
The first of these, the reliance on a supervised learning procedure, can in principle be dispensed 
with by modifying the scheme to incorporate adaptive data-driven clustering, and to associate a 
constant output vector with all the inputs that fall within the same cluster (rather than relying 
on an externally supplied input-output pairs as it is done at present). The second shortcoming of 
the present formulation lies in its disregard of the dynamics of object recognition. In particular, 

10 In addition, cross-modality cues (auditory and haptic) are readily incorporated. 
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the GRBF approach ignores the time course of the recognition process and its modification with 
practice (as manifested in the shift from time-intensive to memory-intensive strategy apparent 
in human performance). 

4.2.3 Comparison with CLF 

A model of object recognition that attempts to address these issues and appears to be related 
to GRBF is the CLF (conjunction of localized features) of [37, 19]. In this model, formulated 
as a two-layer network, units in the second (representation) layer come to represent patterns in 
the first (input) layer through an unsupervised Hebbian reinforcement mechanism. Sequences 
of second-layer units correspond to multiple-view representations of input objects, with the 
association between successive views predicated on the existence of apparent motion between 
the views during training (that is, the two successive views must resemble each other and 
must be sufficiently close in time to be able to elicit the perception of apparent motion in a 
human observer). Thus, on one hand, the CLF model represents an object by a disjunction of 
conjunctions of the presence of features in specific (fuzzy or blurred) locations in the image, 
just as the GRBF module does. On the other hand, the CLF model is able to replicate the 
dynamic behavior of the human object recognition system, mentioned above, through non- 
uniform activation of sequences of representation units and their modification with practice. 

4.2.4 Two predictions 

Assuming that a scheme resembling GRBF or other kind of prototype interpolation is the basis 
of the human ability to recognize objects allows one to formulate strong predictions regarding 
human performance in specific experiments. The most important of these predictions states 
that the ability of the visual system to generalize recognition to a novel view of an object should 
drop off significantly with the misorientation of the novel view relative to the familiar views of 
that object. Furthermore, the drop-off rate should be independent of the relative configuration 
of the familiar views in the viewpoint coordinate space. 

Other contemporary models of object recognition generate different predictions when brought 
to bear on these points. Ullman’s alignment model [5] (as well as the related models of Lowe 
[6] and Thompson & Mundy [4], Biederman’s RBC theory [38] and most computer vision works 
on recognition) predicts no first-order dependency of recognition rate on misorientation. The 
linear scheme of Ullman and Basri [22] also predicts no such dependency, except when its three 
training views axe coplanar in the viewpoint coordinate space. The predictions of the CLF 
model of Edelman [37, 19], on the other hand, agree with these of the GRBF scheme (which is 
not surprising, since the two appear to be related). 

Experimental findings seem to support the limited generalization view of recognition shared 
by the GRBF and the CLF models. Descriptions of some relevant experiments cam be found 
in [39, 40, 41, 34, 42, 35, 43]. Work that should further elucidate this point is currently under 
way in our laboratory. 

5 Summary 

We have described experiments with a versatile pictorial prototype-based learning scheme for 
3D object recognition. The GRBF scheme seems to be amenable to realization in biophysical 
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hardware because the only kind of computation it involves can be effectively carried out by 
combining receptive fields. Furthermore, the scheme is computationally imposing because it 
brings together the old notion of a “grandmother” cell and the rigorous approximation methods 
of regularization and splines. 

Acknowledgement 

We thank Federico Girosi and Shimon Ullman for useful comments, and Daphna Weinshall for 
many discussions and for helping us with understanding [29]. 

6 Appendix: The result of Ullman and Basri for orthographic 
projection 

Ullman and Basri [29, 22] have recently discovered the striking fact that under orthographic 
projection a view of a 3D object is the linear combination of a small number of views of the 
same object. In this appendix, we reformulate their results in the more abstract setting of linear 
algebra. This framework makes the result very transparent: the constraint of linear transfor¬ 
mation (the same linear transformation for each vertex) implies immediately that the set of 
views of an object spans a 9-dimensional space, independently of the number of vertices; or¬ 
thographic projection preserves linearity while reducing the number of dimensions to 6. Simple 
considerations show that the linear spaces of the x and y coordinates are nonintersecting and 
that each has dimension 3. This appendix describes the previous statements in more details. 

0.1 Any view of a 3D object is a linear combination of a small, fixed numer 
of views 

This section provides the main re stilt (in the second subsection). 

6.1.1 Any 8D-view of an object is a linear combination of 9 views 

Let us define a 3D-view of a 3D object as: 

*i 

Vi 
zi 
*2 
2/2 
-*2 


Xn 
2/n 

with X (E 7£ 3n , which is a vector space in the usual way. 
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Consider the set of uniform (our definition) linear operators on defined by the 3 n X 3ra 
matrices L 3n : 


where 
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later). 

The space of the L 3n operators is a vector space which is isomorphic to the vector space of 
the L matrices. It therefore has a basis of 9 elements independently of n. We can express 


L 3n = Y, a ili n 

i=i 

where are the l{j and i 3n is the usual basis for L 3n , and thus 

9 9 

X obj = L 3n X o6j = ^ a . L 3n X o6i = £ a;Xf-»' 

»=1 *=1 

where are 9 independent 3D views of the specific object, needed to span the 9 elements of 
L, 3 for each coordinate. Thus: 


Theorem 6.1 The vector space VJj 3D generated by uniform linear transformations on a 3D view 
of a specific object is a 9-dimensional subspace ofR, 3n (3 dimensions each for x, y and z). 

Thus any object obi generates a corresponding low dimensional subspace of all possible 
views of all objects (7Z 3n ). Of course, ^ 7£ 3n , iff n > 3. In other words, to have object 
specificity, i.e., for this result to be nontrivial , it is necessary that n > 3. Notice that in a sense 

n 3n = v obi + Vobi +.... 


6.1.2 Any 2D-view of a 3D object is a linear combination of 6 2D-views 

Now consider the orthographic projection P : 7l 3n —>■ 7 Z 2n , defined by PX = x, that is 
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with P being a linear operator with the matrix representation 
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We define x as the 2D-view of a 3D object. The result below follows immediately (6 views span 
the elements of L in the first 2 rows) find is the main result of Ullman and Basri (in a different 
formulation): 


Theorem 6.2 The vector space given by = PV^) is a 6-dimensional subspace oflZ 2n 
(the space of all 2D orthographic views of all 3D objects), i.e. x Q b = £)f=i a^x^. 

The inclusion of rigid translations is equivalent to the addition of a two-dimensional linear 
subspace (the same for all objects), spanned by the vectors 


x obj _ 
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6.2 The x and the y coordinates of a view are each a separate linear combi¬ 
nation of 3 views 

In the previous section we have seen that any 2D-view of a 3D object under orthographic pro¬ 
jection is the linear combination of 6 2D-views. This section reformulates another observation of 
Ullman and Basri: the * coordinates of a 2D-view are a linear combination of the x coordinates 
of 3 2D-views and the y coordinates are an independent linear combination of the y coordinates 
of 3 2D-views. 

Let us consider a similarity transformation of x: 


*2 

*3 


2/1 

2/2 


2/n 


\z n J 

Under this similarity transformation, L 3n becomes a 3 X 3 matrix of 9 (that is 3 X 3) blocks. 
Each block is a multiple of I £ 7Z n,n (notice the “isomorphism” to L). 


where 
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and so on for the other blocks. 

The same argument of the previous section makes it clear that if we define 
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then the following holds: 


*=i 

3 

V = '52 l 2iVi- 

i =1 

Thus we have proved: 

Theorem 6.3 The subspace spanned by the vectors £ - the x components of x 06 ^ - which is an 
n-dimensional subspace of v£> (which is 2n- dimensional), is spanned by three views of the x 
coordinates of the object undergoing uniform transformations, i.e., each £ can be represented as 
the linear combination of 3 independent &. The same is true for the tj: each rj is an independent 
linear combination of 3 77 *. Again, n > 3 in order for this to be non-trivial (since £ = TZ n for 
n < Z). 

Remark: The bases of £ and the basis of 77 depend on the specific object. 

6.3 V x and V y have the same basis, i.e., 1.5 snapshots suffice 

We know from the previous sections that V™ = © V**, where dimV x = dimV y = 3. A 

stronger property holds 

Theorem 6.4 V x = V y 

Proof. Assume that V x and V y are not identical: then there is a vector y which is in V y and 
not in V x (or vice versa). Then one can take the 3D view that originated y (through orthogonal 
projection) and apply to it a legal transformation consisting of a rigid rotation of 90 degrees in 
the image plane (such a transformation is in L and therefore is legal). The x view of that 3D 
vector is the y, contradicting the assumption. It follows that V x = V y . 

Remarks'. 

1 . The same argument shows that V x = V y = V z . 

2. The same basis of three vectors spans V x and V y (separately). 



3. The property that the x views and the y views of the same 3D object from the same 
snapshot are independent is generic, since if they were dependent, a very slightly different 
object, differing only in the y coordinate of one vertex would have independent views 
(observation due to Bruno Caprile). 

4. In general, 1.5 snapshots are sufficient. 

6.4 The case of rigid transformations, i.e., rotations in 3D 

The previous two section have considered the case of uniform linear transformations in 3D of 
a 3D object. The space of such transformat ions is a vector space that contains as a nonlinear 
subspace the space of the rigid rotations in 3D (which is easily seen not to be a vector space). 
Can we characterize what the restriction to rigid rotations means? This section addresses this 
question. 

Consider the restriction L = R with R T R = I. Then: 

f *11 + *12 + *13 = 1 
S *11*21 + *12*22 + *13*23 = 0 
l *21 + *22 + *23 = 1 

The equations define a nonlinear subspace of the space £ = {*n, *i 2 > * 13 } isomorphic to TZ 3 , 
and of t/ = {* 21 , * 22 » * 23)5 also isomorphic to TZ 3 . Of course, £ is a linear subspace of TZ n , the 
space of all views of the x coordinates of all objects. Rotations are the intersection of £ with 
the conics defined by the previous equations. 

The 2D views of one object defined by uniform affine transformations span {* 11 , * 12 , * 13 } = 
1Z 3 . The 2D views of one object defined by rigid transformations, i.e., rotations, span a nonlinear 
subspace of TZ 3 , namely, the surface of the unit sphere in 1Z 3 . All points on the unit sphere are 
allowed for {/n, / 12 , * 13 } (thus we “use up” two parameters). The triplet (* 21 * 22 * 23 ) is deter mi ned 
as one parameter family. Geometrically, once the vector * 11 , * 12 , *13 is fixed on the unit sphere, 
an orthogonal circle is determined on which the vector (/ 2 i, * 22 »* 23 ) must lie. 

6.5 Summary of the appendix 

The main point of this appendix can be summarized as a characterization of the algebraic 
structure (as a linear vector space) of the views of one object under orthographic projection. 

Consider the space 1Z 3N of 3D views of all objects. Consider the subspace generated 
by one view of a specific object and by the action on it of the group of uniform transformations 
£, that transform in the same way each vertex. £ is an algebra of order 9, and therefore a 
linear vector space isomorphic to TZ 3 X TZ 3 . Thus, V 3 ^ is a linear vector space isomorphic to 
TZ 9 ■ The projection operator (orthographic projection) that deletes the z components from 
the 3D views, maps into a linear vector subspace V™, isomorphic to TZ 6 . V™ consists 
of vector with x and y components and can be written as the direct sum v$! = v« ® V„ K , 
where and Vj* are non-intersecting linear subspaces, each isomorphic to TZ 3 . In addition, 
we have proved that = V**, which implies that 1.5 snapshots are sufficient for “learning” 
an object (in general). If 3D translations are included, a linear subspace, isomorphic to TZ 2 , 
must be added to the linear space spanned by the 2D views of one object. The above is a short 
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alternative proof of the main results of Ullman and Basri [29] (with the exception of the 1.5 

views result, see [22]). 
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